Introduction

Why sports analytic?

Sports analytics is one of the most demanding topics in statistical modeling these days. Different sport teams invest lots of money on their team each year. They need to evaluate carefully how they choose and who they need in different posts, what playing strategies works better for their team, e.g. offensive, defensive in order to spend less but achieve more. The team managers use these statistical models to determine what characteristics make a college player become a robust professional player as well as many other crucial questions and decisions they need to make each season. Sports analytics are a combination of historical events, statistical analysis, simulations and modeling according to a teams individual players, teams tactics and teams opponents that when properly analyzed and applied in the team can provide a competitive advantage to a team or even the individual players. By analyzing this data, sports analytics inform players and coaches in order to improve and aid in decision making both during, prior and post game. According to Wikipedia the term “sports analytics” was spread in typical sports culture following the release of the 2011 movie Moneyball, the Oakland Athletics team of 2002 created history by winning 20 consecutive games between August 13 and September 20 2002. Much of the Oakland Athletics success in that season is credited to a graduate in Economics from Harvard University joining the team in 1999 and implementing analysis of baseball statistics to assess and purchase players. Sports analytics helps sport teams make better, informed decisions which increases the performance of a sport team through better resource management. On top of that with sports analytics team managers can save or even makes money for the club by wisely investing on the players, finding unknown talent or find weaknesses in another player or a team. Hence these days sports analytics is rapidly becoming an essential element in the success of sports teams.

Statement of personal interest

In this project we decided to enter the interesting and fast evolving field of data science. Our objective for this project is to statistically investigate the different critical features of baseball teams and players and find the significant ones to be able to predict the performance of a baseball team and/or performance of a player within a confidence level. Base on statistical data our main intention is to answer the following questions:

  • Predict a player’s batting average, to assess the player quality for the post.
  • Statistically predict how does a team make the playoffs?
  • What evidences make a team more winner?
  • How does a team score more runs?
  • Develope a predictive model.

Some definitions: At-bats or batting average of a baseball player is the ratio of his number of hits to his number of opportunities-to-hit. Professional baseball players typically have at-bats average somewhere between 0.20 and 0.30. The player’s at-bats in a given year can be predicted by his cumulative history of performance and it can be considered as a critical feature of a professional player. On the other hand, features such as the number of runs allowed, runs scored, on base percentage, and slugging percentage of a team are the main features to evaluate the performance of a baseball team in the season. According to the statistical analysis a baseball team required 90 to 100 wins in order to make it to the playoffs. A team typically needs to score 800 to 820 runs and allow 600-650 runs in order to make it to postseason.

In this project we intend to utilize the knowledge we gained in the course to analyze and understand how a statistical method can be used to develop a predictive model in sports analytics. We will use R to first clean and prepare the data and study the distribution of each individual predictor and identify any possible outliers. Then we will investigate the significance of our predictors to predict wins of a baseball team and the batting average of a baseball player based on our predictors. We will use model selection techniques: BIC and AIC considering adjusted R squared and LOOCV RMSE measures for a multiple linear regression and/or polynomial regression. At the end we will investigate the different informative criteria for a baseball team to make to playoff, postseason or championship.

Description of the dataset

For this project we consider two sets of data:

1) MLB Statistics 1962-2012 In the early 2000s, Billy Beane and Paul DePodesta worked for the Oakland Athletics. With helps of statistical analysis they changed the game of baseball drastically. They realized that statistical data like on-base percentage and slugging percentage are very important when it came to predict scoring runs. They could predict performance of a player and recruit the player with a small budget before other team manager even think of the player. This data set contains some of the information that was available to Beane and DePodesta in the early 2000s. This dataset contains 1232 cases with 15 feature variables. The features are as follow:

  • Team
  • League
  • Year
  • RS Runs Scored
  • RA Runs Allowed
  • W Wins
  • OBP On-Base Percentage
  • SLG Slugging Percentage
  • BA Batting Average
  • Playoffs
  • RankSeason
  • RankPlayoffs
  • G Games Played
  • OOBP Opponent On-Base Percentage
  • OSLG Opponent Slugging Percentage
MLBData = read.csv("MLB.csv")
head(MLBData)
##   Team League Year  RS  RA  W   OBP   SLG    BA Playoffs RankSeason
## 1  ARI     NL 2012 734 688 81 0.328 0.418 0.259        0         NA
## 2  ATL     NL 2012 700 600 94 0.320 0.389 0.247        1          4
## 3  BAL     AL 2012 712 705 93 0.311 0.417 0.247        1          5
## 4  BOS     AL 2012 734 806 69 0.315 0.415 0.260        0         NA
## 5  CHC     NL 2012 613 759 61 0.302 0.378 0.240        0         NA
## 6  CHW     AL 2012 748 676 85 0.318 0.422 0.255        0         NA
##   RankPlayoffs   G  OOBP  OSLG
## 1           NA 162 0.317 0.415
## 2            5 162 0.306 0.378
## 3            4 162 0.315 0.403
## 4           NA 162 0.331 0.428
## 5           NA 162 0.335 0.424
## 6           NA 162 0.319 0.405

2) Major league baseball player statistics data

This dataset contains 4535 cases with 50 feature variables of a group of individual baseball players during the years 1959-2004, it was obtained from the Lahman Baseball Database.

  • YEAR
  • YRINDEX Year index (1959=1)
  • PLAYERID Unique player ID
  • NAMElast Last name
  • NAMEfirst First name
  • TEAM Team(s) played on that year
  • LG League(s) played in that year
  • LGCODE League code (0=AL,1=both, 2=NL)
  • G Games
  • AB At bats (number of turns at bat)
  • R Runs (player advances around all 4 bases and scores)
  • H Hits (player hits a fair ball and reaches base safely without losing a runner already on base)
  • HR Home runs (balls that fly out of the ballpark and allow the batter to run all the way home)
  • RBI Runs batted in (number of runners who score on the basis of a player’s hits)
  • TB Total bases (=H+DB+2TR+3HR)
  • OB On base (=H+BB+HBP)
  • PA Plate appearances (=H+BB+HBP+SF)
  • DBL Doubles (long hits that allow the batter to run all the way to second base)
  • TR Triples (very long hits that allow the batter to run all the way to third base)
  • SB Stolen bases (runner “steals” the next base by running to it when no one is looking)
  • CS Caught stealing (runner is tagged out with the ball while attempting to steal a base)
  • BB Bases on balls (player “walks” to first base if the pitcher throws 4 bad pitches)
  • SO Struck out (player is out by swinging-and-missing and/or not-swinging-at-good-pitch 3 times)
  • IBB Intentional bases on balls (pitcher deliberately throws 4 bad pitches to avoid risking a hit)
  • HBP Hit by pitch (player gets to go to first base on the basis of getting hit by pitch)
  • SH Sacrifices (player hits the ball and gets himself out but enables another runner to advance)
  • SF Sacrifice flies (sacrifices that are balls caught in the air in the outfield)
  • GIDP Grounded into double play (player hits a ball that gets himself out and also another runner)
  • AVG Batting average (=H/AB)
  • OBP On base percentage (=OB/PA)
  • SLG Slugging percentage (=TB/AB)
  • AVGcum Cumulative batting average of the same player (in his career since 1959)
  • OBPcum Cumulative on-base percentage
  • SLGcum Cumulative slugging percentage
  • ABcum Cumulative total at-bats
  • Rcum Cumulative total runs
  • Hcum Cumulative total hits
  • HRcum Cumulative total home runs
  • RBIcum Cumulative total runs batted in
  • PAcum Cumulative total plate appearances
  • OBcum Cumulative total of total-on-base
  • TBcum Cumulative total of total-bases
  • EXP Experience (# years after and including the first year in which CPA>50)
  • PAYR Plate appearances per year of experience (since 1959)
  • MLAVG Major league batting average (same year, same set of players)
  • MLOBP Major league on base percentage
  • MLSLG Major league slugging average
  • MLRAVG Major league average runs per plate appearance
  • MLHRAVG Major league average home runs per plate appearance
  • MLRBIAVG Major league average RBI’s per plate appearance
LahmanData = read.csv("LahmanBaseballDatabase.csv")
head(LahmanData)
##   YEAR YRINDEX  PLAYERID NAMElast NAMEfirst TEAM LG LGCODE   G  AB   R   H HR
## 1 1960       2 aaronha01    Aaron      Hank  ML1 NL      2 153 590 102 172 40
## 2 1961       3 aaronha01    Aaron      Hank  ML1 NL      2 155 603 115 197 34
## 3 1962       4 aaronha01    Aaron      Hank  ML1 NL      2 156 592 127 191 45
## 4 1963       5 aaronha01    Aaron      Hank  ML1 NL      2 161 631 121 201 44
## 5 1964       6 aaronha01    Aaron      Hank  ML1 NL      2 145 570 103 187 24
## 6 1965       7 aaronha01    Aaron      Hank  ML1 NL      2 150 570 109 181 32
##   RBI  TB  OB  PA DBL TR SB CS BB SO IBB HBP SH SF GIDP   AVG   OBP   SLG EXP
## 1 126 334 234 664  20 11 16  7 60 63  13   2  0 12    8 0.292 0.352 0.566   2
## 2 120 358 255 670  39 10 21  9 56 64  20   2  1  9   16 0.327 0.381 0.594   3
## 3 128 366 260 667  28  6 15  7 66 73  14   3  0  6   14 0.323 0.390 0.618   4
## 4 130 370 279 714  29  4 31  5 78 94  18   0  0  5   11 0.319 0.391 0.586   5
## 5  95 293 249 634  30  2 22  4 62 46   9   0  0  2   22 0.328 0.393 0.514   6
## 6  89 319 242 639  40  1 24  4 60 81  10   1  0  8   15 0.318 0.379 0.560   7
##   PAYR MLAVG MLOBP MLSLG AVGcum OBPcum SLGcum ABcum Rcum Hcum HRcum RBIcum
## 1  598 0.265 0.335 0.407  0.324  0.377  0.602  1219  218  395    79    249
## 2  598 0.268 0.340 0.418  0.325  0.378  0.599  1822  333  592   113    369
## 3  598 0.268 0.337 0.412  0.324  0.381  0.604  2414  460  783   158    497
## 4  598 0.255 0.319 0.389  0.323  0.383  0.600  3045  581  984   202    627
## 5  598 0.261 0.324 0.396  0.324  0.385  0.587  3615  684 1171   226    722
## 6  598 0.255 0.322 0.389  0.323  0.384  0.583  4185  793 1352   258    811
##   PAcum OBcum TBcum G_Lag1 AB_Lag1 R_Lag1 H_Lag1 HR_Lag1 RBI_Lag1 TB_Lag1
## 1  1357   512   734    154     629    116    223      39      123     400
## 2  2027   767  1092    153     590    102    172      40      126     334
## 3  2694  1027  1458    155     603    115    197      34      120     358
## 4  3408  1306  1828    156     592    127    191      45      128     366
## 5  4042  1555  2121    161     631    121    201      44      130     370
## 6  4681  1797  2440    145     570    103    187      24       95     293
##   OB_Lag1 PA_Lag1 DBL_Lag1 TR_Lag1 SB_Lag1 CS_Lag1 BB_Lag1 SO_Lag1 IBB_Lag1
## 1     278     693       46       7       8       0      51      54       17
## 2     234     664       20      11      16       7      60      63       13
## 3     255     670       39      10      21       9      56      64       20
## 4     260     667       28       6      15       7      66      73       14
## 5     279     714       29       4      31       5      78      94       18
## 6     249     634       30       2      22       4      62      46        9
##   HBP_Lag1 SH_Lag1 SF_Lag1 GIDP_Lag1 AVG_Lag1 OBP_Lag1 SLG_Lag1 EXP_Lag1
## 1        4       0       9        19    0.355    0.401    0.636        1
## 2        2       0      12         8    0.292    0.352    0.566        2
## 3        2       1       9        16    0.327    0.381    0.594        3
## 4        3       0       6        14    0.323    0.390    0.618        4
## 5        0       0       5        11    0.319    0.391    0.586        5
## 6        0       0       2        22    0.328    0.393    0.514        6
##   AVGcum_Lag1 OBPcum_Lag1 SLGcum_Lag1 ABcum_Lag1 Rcum_Lag1 Hcum_Lag1 HRcum_Lag1
## 1       0.355       0.401       0.636        629       116       223         39
## 2       0.324       0.377       0.602       1219       218       395         79
## 3       0.325       0.378       0.599       1822       333       592        113
## 4       0.324       0.381       0.604       2414       460       783        158
## 5       0.323       0.383       0.600       3045       581       984        202
## 6       0.324       0.385       0.587       3615       684      1171        226
##   RBIcum_Lag1 PAcum_Lag1 OBcum_Lag1 TBcum_Lag1
## 1         123        693        278        400
## 2         249       1357        512        734
## 3         369       2027        767       1092
## 4         497       2694       1027       1458
## 5         627       3408       1306       1828
## 6         722       4042       1555       2121

Methods

Creating Functions and loading necessary libraries

In this section we are creating necessary functions to be used later in order to analyze the data. Also to be more organized all of the necessary libraries are gathered and loaded in this section.

Functions

diagnostics() functions

# make_conf_mat() function--------------------------
### create function to generate the Confusion Matrix for a classifier
make_conf_mat = function(predicted, actual) {
  table(predicted = predicted, actual = actual)
}

# diagnostics() function--------------------------
### create function to generate fitted vs. residuals plot and normal qq-plot
diagnostics = function(model, pcol = "grey", lcol = "dodgerblue", alpha = 0.05, plotit = TRUE, testit = TRUE){
  if(plotit) {
    par(mfrow = c(1,2))
    plot(fitted(model), resid(model), col = pcol, pch = 20,
    xlab = "Fitted", ylab = "Residuals", main = "Fitted versus Residuals")
    abline(h = 0, col = lcol, lwd = 2)
    
    qqnorm(resid(model), main = "Normal Q-Q Plot", col = pcol)
    qqline(resid(model), col = lcol, lwd = 2)
  }
 if(testit) {
   results = list(p_val = 0.0, decision = "reject")
   results$p_val = shapiro.test(resid(model))$p.value
   if(results$p_val < alpha){
     results$decision = "Reject"
   } else {
     results$decision = "Fail to Reject."
   }
   return(results)
 }
}


### `outlier_diagnostic()` function
# get_influence() function-----------------------------------
### create function to compute influence for all observations

get_influence = function(model){
  influence = cooks.distance(model)
  return(influence)
}


# hi_influence_index() function-------------------------
### create function to index high influence observations

hi_influence_index = function(model){
# compute influence for all observations
  influence = get_influence(model)
# index high influence observations
  hi_influence_index = influence > 4 / length(influence)
  return(hi_influence_index)
}


# outlier_count() function-------------------------------------
### create function to count the number of outliers for a model

outlier_count = function(hi_influence_index){
# count number of high influence observations
  outlier_count = sum(hi_influence_index)
  return(outlier_count)
}


# outlier_remove() function------------------------------------------
### create function to remove significant outliers from training data
    ### remove observations on the basis of influence (cook's distance) only

outlier_remove = function(hi_influence_index, train_set){
# subset train data to exclude high influence observations
  clean_set = train_set[-hi_influence_index, ]
  return(clean_set)
}


# outlier_diagnostic() function----------------------------------------------------
### create function to package the functionalities of outlier_count / outlier_remove
### enriching/processing raw data

outlier_diagnostic = function(model, train_set = NULL){
  # refit branch
  if(!is.null(train_set)){
    # index high influence observations
      hi_influence_index = hi_influence_index(model)
    # count number of outlier
      outlier_count = outlier_count(hi_influence_index)
    # remove significant outliers from training data
      clean_set = outlier_remove(hi_influence_index, train_set)
    # format output
      out = list(outlier_count = outlier_count, 
                 clean_set = clean_set)
      return(out)
  } else {
    # index high influence observations
      hi_influence_index = hi_influence_index(model)
    # count number of outlier
      outlier_count = outlier_count(hi_influence_index)
      # format output
      out = list(outlier_count = outlier_count)
      return(out)
  }
}

goodness_of_fit() functions

# get_loocv_rmse() function--------------
### create function to compute loocv rmse

get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}


# get_AIC() function----------------
### create function to calculate AIC

get_AIC= function(model) {
  AIC(model)
}


# get_BIC() function----------------
### create function to calculate BIC

get_BIC= function(model) {
  BIC(model)
}


# get_adj_r2() function--------------------------
### create function to extract adjusted r-squared

get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}


# rmse() function-----------------------------------------------
### create function to compute rmse given actual and fitted data

rmse = function(y_actual, y_hat, n){
  e = y_actual - y_hat
  sse = sum(e ^ 2)
  rmse = sqrt(sse / n)
  return(rmse)
}

# get_test_rmse() function------------------
### create function to compute test set rmse

get_test_rmse = function(model, test_set, response) {
# predict responses using the fitted model on test set predictor data
  y_hat = as.vector(predict(model, test_set))
# store actual responses in vector
  y_actual = as.vector(test_set[, response])
# extract number of observation
  n = length(y_actual)
# compute rmse
  rmse(y_actual, y_hat, n)
}


# goodness_of_fit() function------------------------------
### create function to compute all goodness of fit measures

goodness_of_fit = function(model, test_set, response) {
# get loocv rmse
  loocv_rmse = get_loocv_rmse(model)
# get AIC
  AIC = get_AIC(model)
# get BIC
  BIC = get_BIC(model)
# get adjusted r-squared
  adj_r2 = get_adj_r2(model)
# get test set rmse
  test_rmse = get_test_rmse(model, test_set, response)
# format output
  out = list(loocv_rmse = loocv_rmse, AIC = AIC, BIC = BIC, adj_r2 = adj_r2, test_rmse = test_rmse)
  return(out)
}

test_collinearity() functions

# test_decision_VIF() function------------------------------------------------------
### create function to make a statistical decision on the basis of a given threshold

test_decision_VIF = function(p, threshold){
# decision branch
if(isTRUE(p > threshold) == TRUE){
# reject null
  return(paste("Reject"))
} else {
# fail to reject null
  return(paste("Fail to Reject"))}
}


# get_VIF() function------------------------------------------------------------------------
### create function to compute the variance inflation factors for all est. beta coefficients

get_VIF = function(model) {
  vif(model)
}


# test_collinearity() function---------------------------------------------------------
### create function to extract max VIF and determine whether collinearity is an issue
### "Reject" = collinearity is an issue, "Fail to Reject" = collinearity is not an issue

VIF_conventional_alpha = 5

test_collinearity = function(model, alpha = VIF_conventional_alpha) {
# compute VIFs
  vif = get_VIF(model)
# extract max VIF
  max_vif = max(abs(vif))
# determine whether collinearity is an issue
  collinearity = test_decision_VIF(max_vif, alpha)
# format output
  out = list(max_vif = max_vif, collinearity = collinearity)
  return(out)
}

resid_diagnostic() function

# test_decision() function------------------------------------------------------
### create function to make a statistical decision on the basis of a given alpha

test_decision = function(p, alpha){
# decision branch
if(isTRUE(p < alpha) == TRUE){
# reject null
  return(paste("Reject"))
} else {
# fail to reject null
  return(paste("Fail to Reject"))}
}


# test_normality() function----------------------------------------------------
### create function to perform test for the normality of fitted model residuals
### "Reject" = normality is suspect, "Fail to Reject" = normality is not suspect

test_normality = function(model){
  test = shapiro.test(resid(model))
  p = test$p
  return(p)
}


# test_constantvar() function---------------------------------------------------------
### create function to perform test for the homoscedasticity of fitted model residuals
### "Reject" = homoscedasticity is suspect, "Fail to Reject" = homoscedasticity is not suspect

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
test_constantvar = function(model){
  test = bptest(model)
  p = as.vector(test$p.value)
  return(p)
}


# test_diagnostic() function---------------------------------------------------------
### create function to package the functionality of test_normality / test_constantvar

test_diagnostic = function(model, alpha){
# perform Shapiro-Wilk normality test
  p_norm = test_normality(model = model)
# determine statisitical decision arrived at through S-W test
  decision_norm = test_decision(p = p_norm, alpha = alpha)
# perform Breusch-Pagan normality test
  p_convar = test_constantvar(model = model)
# determine statisitical decision arrived at through B-P test
  decision_convar = test_decision(p = p_convar, alpha = alpha)
# format output
  out = list(p_norm   = p_norm  , decision_norm   = decision_norm,
             p_convar = p_convar, decision_convar = decision_convar)
  return(out)
}


# resid_diagnostic() function---------------------------------------------------------
### create function to package the functionality of  plot_qq/ plot_fvr / test_diagnostic

resid_conventional_alpha = 0.05

resid_diagnostic = function(model, pcol = "slategrey", lcol = "red", alpha = resid_conventional_alpha, 
                           plot_qq = FALSE, plot_fvr = FALSE, testit = FALSE){
  # store plot_qq output
    qq = function(){plot_qq(model, pcol, lcol)}
  # store plot_fvr output
    fvr = function(){plot_fvr(model, pcol, lcol)}
    
  # output scheme
  if(testit) {
    return(test_diagnostic(model, alpha))
  }
  if(plot_qq) {
    qq()
  }
  if(plot_fvr) {
    fvr()
  }
}  

Libraries

Exploring the datasets

1) MLB Statistics 1962-2012

  • Some grouping of the dataset

  • Teams made to playoff

MLBData$RunDiff = MLBData$RS - MLBData$RA
playOff     = MLBData[which(!is.na(MLBData$RankPlayoff)),]
notPlayOff  = MLBData[which(is.na(MLBData$RankPlayoff)),]

As mentioned the MLB dataset has 15 feature varables in our project to create a predictive model we considered 9 variables as the following:

To get insgight about the MLB dataset and its feature variables lets first take a look to the distribution of the each variables:

pairs(MLBData, col="blue")

- lets also check the variation of some most important predictors:

par(mfrow = c(1,2))
boxplot(MLBData[c(4, 5, 6)], data = mpg,
pch = 20,
cex = 2,
outline = TRUE,
col = "pink",
border = "purple")
boxplot(MLBData[c(7, 8, 9, 14, 15)], data = mpg,
pch = 20,
cex = 2,
outline = TRUE,
col = "pink",
border = "purple")

  • To predict win or playoff situation of a team four predictors are at most importance which are RA, RS, SLG and OBP lets take a look at their distribution:
par(mfrow = c(2,2)) 

hist(MLBData$RA,
xlab = "Runs Allowed (RA)",
breaks = 30,
col = "darkgreen",
main = "",
prob = TRUE,
border = "darkorange")
axis(side=1, at=seq(400,1200, 100))
axis(side=2, at=seq(0.000,0005., 0.001))
abline(v = mean(MLBData$RA), lty = 2 ,lwd = 4,col = "red")
x= MLBData$RA
curve(dnorm(x, mean = mean(MLBData$RA), sd = sd(MLBData$RA)), col = "red", add = TRUE, lwd = 3)
grid(lwd=1)

hist(MLBData$RS,
xlab = "Runs Scored (RS)",
breaks = 30,
col = "purple",
main = "",
prob = TRUE,
border = "darkorange")
abline(v = mean(MLBData$RS), lty = 2 ,lwd = 4,col = "red")
axis(side=1, at=seq(400,1200, 100))
axis(side=2, at=seq(0.000,0005., 0.001))
x= MLBData$RS
curve(dnorm(x, mean = mean(MLBData$RS), sd = sd(MLBData$RS)), col = "red", add = TRUE, lwd = 3)
grid(lwd=1)

hist(MLBData$OBP,
xlab = "On Base Percentage (OBP)",
breaks = 30,
col = "darkblue",
main = "",
prob = TRUE,
border = "darkorange")
abline(v = mean(MLBData$OBP), lty = 2 ,lwd = 4,col = "red")
axis(side=1, at=seq(0.24,0.4, 0.04))
axis(side=2, at=seq(0.000,30., 5))
x= MLBData$OBP
curve(dnorm(x, mean = mean(MLBData$OBP), sd = sd(MLBData$OBP)), col = "red", add = TRUE, lwd = 3)
grid(lwd=1)

hist(MLBData$SLG,
xlab = "Slugging Percentage (SLG)",
breaks = 30,
col = "dodgerblue",
main = "",
prob = TRUE,
border = "darkorange")
axis(side=2, at=seq(0.000,14., 2))
abline(v = mean(MLBData$SLG), lty = 2 ,lwd = 4,col = "red")
x= MLBData$SLG
curve(dnorm(x, mean = mean(MLBData$SLG), sd = sd(MLBData$SLG)), col = "red", add = TRUE, lwd = 3)
grid(lwd=1)

  • As it can be seen all of the are approximately has a normal distributtion. RA and RS seems to be skewed to the right a little bit.

  • Although according to a baseball analyzer a team required at least 99 wins in order to make to the playoff, we checked that here and we found that instead of 99 wins it seems Slugging Percentage (SLG) is a better criteria with less error so we would say if a team has a 99 SLG and above it will difnietlt make to the play off.

par(mfrow = c(3,2)) 
plot(RS ~ W, data = playOff, xlab = "Team Wins", ylab = "Runs scored (RS)", main = "Team wins VS. Runs Scored", pch = 17, cex = 2, xlim = c(min(playOff$W, notPlayOff$W), max(playOff$W, notPlayOff$W)), ylim = c(min(playOff$RS, notPlayOff$RA), max(playOff$RS, notPlayOff$RS)), col = "darkred")
points(RS ~ W, data = notPlayOff, col='blue', pch = 2, cex = 2) 
legend("topleft", c("play off", "not play off"), pch = c(17, 2), col = c("darkred", "blue"))
abline(v = 99, lty = 2 ,lwd = 4,col = "red")

plot(RA ~ W, data = playOff, xlab = "Team Wins", ylab = "Runs allowed (RA)", main = "Team wins VS. Runs Allowed", pch = 16, cex = 2, xlim = c(min(playOff$W, notPlayOff$W), max(playOff$W, notPlayOff$W)), ylim = c(min(playOff$RA, notPlayOff$RA), max(playOff$RA, notPlayOff$RA)), col = "darkred")
points(RA ~ W, data = notPlayOff, col='blue', pch = 1, cex = 2) 
legend("topleft", c("play off", "not play off"), pch = c(16, 1), col = c("darkred", "blue"))
abline(v = 99, lty = 2 ,lwd = 4,col = "red")

plot(OBP ~ W, data = playOff, xlab = "Team Wins", ylab = "On Base Percentage (OBP)", main = "Team wins VS. On Base Percentage", pch = 18, cex = 2, xlim = c(min(playOff$W, notPlayOff$W), max(playOff$W, notPlayOff$W)), ylim = c(min(playOff$OBP, notPlayOff$OBP), max(playOff$OBP, notPlayOff$OBP)), col = "darkred")
points(OBP ~ W, data = notPlayOff, col='blue', pch = 3, cex = 2) 
legend("topleft", c("play off", "not play off"), pch = c(18, 3), col = c("darkred", "blue"))
abline(v = 99, lty = 2 ,lwd = 4,col = "red")

plot(SLG ~ W, data = playOff, xlab = "Team Wins", ylab = "Slugging Percentage (SLG)", main = "Team wins VS. Slugging Percentage", pch = 19, cex = 2, xlim = c(min(playOff$W, notPlayOff$W), max(playOff$W, notPlayOff$W)), ylim = c(min(playOff$SLG, notPlayOff$SLG), max(playOff$SLG, notPlayOff$SLG)), col = "darkred")
points(SLG ~ W, data = notPlayOff, col='blue', pch = 4, cex = 2) 
legend("topleft", c("play off", "not play off"), pch = c(19, 4), col = c("darkred", "blue"))
abline(v = 99, lty = 2 ,lwd = 4,col = "red")

plot(BA ~ W, data = playOff, xlab = "Team Wins", ylab = "Batting Average (BA)", main = "Team wins VS. Batting Average", pch = 20, cex = 2, xlim = c(min(playOff$W, notPlayOff$W), max(playOff$W, notPlayOff$W)), ylim = c(min(playOff$BA, notPlayOff$BA), max(playOff$BA, notPlayOff$BA)), col = "darkred")
points(BA ~ W, data = notPlayOff, col='blue', pch = 5, cex = 2) 
legend("topleft", c("play off", "not play off"), pch = c(20, 5), col = c("darkred", "blue"))
abline(v = 99, lty = 2 ,lwd = 4,col = "red")

  • Also instead of the orignial feature variables RA and RS if we consider the difference between them (RunDiff = RA - RS) we can predict team wins much better as a linear predictor so we added this feature to the dataset.
model = lm(RunDiff ~ W, data = MLBData)
plot(RunDiff ~ W, data = playOff, xlab = "Team Wins", ylab = "RS-RA", main = "Team wins VS. Difference between RS and RA", pch = 20, cex = 2, xlim = c(min(playOff$W, notPlayOff$W), max(playOff$W, notPlayOff$W)), ylim = c(min(playOff$RunDiff, notPlayOff$RunDiff), max(playOff$RunDiff, notPlayOff$RunDiff)), col = "darkred")
points(RunDiff ~ W, data = notPlayOff, col='blue', pch = 5, cex = 2) 
legend("topleft", c("play off", "not play off", "fitted linear line"), pch = c(20, 5), col = c("darkred", "blue", "orange"))
abline(v = 99, lty = 2 ,lwd = 4,col = "red")
abline(model, lty = 1 ,lwd = 4, col = "orange")
grid(lwd=1)

2) Major league baseball player statistics data

Clean Up Data

Baseball_stats = LahmanData
# Remove averages
Baseball_stats = subset(Baseball_stats, select = -c(37: ncol(Baseball_stats)))

# Remove unnecessary and duplicate data
Baseball_stats = subset(Baseball_stats, select = -c(YEAR, PLAYERID, NAMElast, NAMEfirst, LG))

# Coerce Catigorical Variables into factors
Baseball_stats$YRINDEX = as.factor(Baseball_stats$YRINDEX)
Baseball_stats$TEAM = as.factor(Baseball_stats$TEAM)
Baseball_stats$LGCODE = as.factor(Baseball_stats$LGCODE)

Setting Up Train/Test Data

set.seed(420)
train_index  = sample(nrow(Baseball_stats), size = trunc(0.80 * nrow(Baseball_stats)))
train_set = Baseball_stats[train_index, ]
test_set = Baseball_stats[-train_index, ]

Building Models

Classification Model to predict teams make to the playoff

  • In this section we try to build a classifier to predict a team could make to play-off or not.

First we divide the data to train and test data:

#Test-Train Split
set.seed(42)
MLBData_B = MLBData[-c(1, 6, 11, 12, 13, 14, 15)]
# spam_idx = sample(nrow(spam), round(nrow(spam) / 2))
spam_idx = sample(nrow(MLBData_B), 800)
MLBData_trn = MLBData_B[spam_idx, ]
MLBData_tst = MLBData_B[-spam_idx, ]

We considered four different models to build our classifier as:

  1. An additive model with just four predictors: RS + RA + OBP + SLG
model_playoff_additive_small = glm(Playoffs ~ RS + RA + OBP + SLG, data = MLBData_trn, family = "binomial")

Then we calculate the variance inflation factor (VIF) for each of the predictors:

vif(model_playoff_additive_small)
##     RS     RA    OBP    SLG 
## 187.09  61.68  77.02  92.49

all of these are greater then 5 but we still decided to keep them as is.

  1. An additive model with all predictors except Team, RankSeason, RankPlayoffs, G because they are not obviously relevent, for example parameter G shows the number of games a team played so higher value for G means more games played so it means that the team made it to playoff.
model_playoff_additive_All   = glm(Playoffs ~ . , data = MLBData_trn, family = "binomial", maxit = 2000)

Then we calculate the variance inflation factor (VIF) for each of the predictors:

vif(model_playoff_additive_All )
##  LeagueNL      Year        RS        RA       OBP       SLG        BA   RunDiff 
##     15.88     23.06    239.32     72.48    106.53    130.12     61.49 704776.57
  • So as expected RunDiff has a very large VIF because it is equal to RS - RA. As we noticed that RunDiff can peredict the model better than RA and RS as shown in the Exploring the datasets section. So we decided to omit RS and RA:
MLBData_trn_NO_RA_RS = MLBData_trn[-c(3, 4)]
model_playoff_additive_All   = glm(Playoffs ~ ., data = MLBData_trn_NO_RA_RS, family = "binomial", maxit = 2000)
vif(model_playoff_additive_All )
## LeagueNL     Year      OBP      SLG       BA  RunDiff 
##    15.65    22.38    66.82    55.98    61.24    87.76
  • As it can be seen VIF decreased drastically.

3,4) We considered a model chosen via backwards selection using AIC. we used a model that considers all available predictors as well as their two-way interactions for the start of the search.

model_playoff_interactive_all = glm(Playoffs ~ .:., data = MLBData_trn_NO_RA_RS, family = "binomial", maxit = 200)
model_playoff_AIC = step(model_playoff_interactive_all, direction = "backward", method = "AIC", trace = 0, maxit = 200)

-For each mode we obtained a 5-fold cross-validated misclassification rate using the model as a classifier that seeks to minimize the misclassification rate.

set.seed(12)
cv.glm(MLBData_trn, model_playoff_additive_small, K = 5)$delta[1]
## [1] 0.08261
cv.glm(MLBData_trn, model_playoff_additive_All, K = 5)$delta[1]
## [1] 0.07932
cv.glm(MLBData_trn, model_playoff_interactive_all, K = 5)$delta[1]
## [1] 0.09127
cv.glm(MLBData_trn, model_playoff_AIC, K = 5)$delta[1]
## [1] 0.07631
  • The best model on the training data was the model chose by AIC lets calculate the misclassification rate on test data:
# test misclassification rate
(mis1 = mean(ifelse(predict(model_playoff_additive_small, MLBData_tst) > 0, 1, 0) != MLBData_tst$Playoffs))
## [1] 0.1227
(mis2 = mean(ifelse(predict(model_playoff_additive_All, MLBData_tst) > 0, 1, 0) != MLBData_tst$Playoffs))
## [1] 0.09722
(mis3 = mean(ifelse(predict(model_playoff_interactive_all, MLBData_tst) > 0, 1, 0) != MLBData_tst$Playoffs))
## [1] 0.1065
(mis4 = mean(ifelse(predict(model_playoff_AIC, MLBData_tst) > 0, 1, 0) != MLBData_tst$Playoffs))
## [1] 0.09722

On the test data also AIC model did a good job with test misclassification rate of 0.0972

Also we calculate the Sensitivity and Specificity of each model:

#models Sensitivity
#print(c("sensitivity", "specificity"))
conf_mat = make_conf_mat(predicted = ifelse(predict(model_playoff_additive_small, MLBData_tst) > 0, 1, 0), actual = MLBData_tst$Playoffs)
model1 = (c(conf_mat[2, 2] / sum(conf_mat[, 2]), conf_mat[1, 1] / sum(conf_mat[, 1])))
conf_mat = make_conf_mat(predicted = ifelse(predict(model_playoff_additive_All, MLBData_tst) > 0, 1, 0), actual = MLBData_tst$Playoffs)
model2 = (c(conf_mat[2, 2] / sum(conf_mat[, 2]), conf_mat[1, 1] / sum(conf_mat[, 1])))
conf_mat = make_conf_mat(predicted = ifelse(predict(model_playoff_interactive_all, MLBData_tst) > 0, 1, 0), actual = MLBData_tst$Playoffs)
model3 = (c(conf_mat[2, 2] / sum(conf_mat[, 2]), conf_mat[1, 1] / sum(conf_mat[, 1])))
conf_mat = make_conf_mat(predicted = ifelse(predict(model_playoff_AIC, MLBData_tst) > 0, 1, 0), actual = MLBData_tst$Playoffs)
model4 = (c(conf_mat[2, 2] / sum(conf_mat[, 2]), conf_mat[1, 1] / sum(conf_mat[, 1])))
Model sensitivity specificity Misclassification rate
Model 1 0.6133 0.9328 0.1227
Model 2 0.7333 0.9384 0.0972
Model 3 0.6933 0.9356 0.1065
Model 4 0.7333 0.9384 0.0972

So the model 4 (AIC model) is the best model, the parameters are as the following:

summary(model_playoff_AIC)
## 
## Call:
## glm(formula = Playoffs ~ Year + OBP + SLG + BA + RunDiff + OBP:SLG + 
##     SLG:BA + SLG:RunDiff, family = "binomial", data = MLBData_trn_NO_RA_RS, 
##     maxit = 200)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5877  -0.3155  -0.0623  -0.0023   3.0718  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -75.7149    43.7319   -1.73   0.0834 .  
## Year            0.0542     0.0113    4.80  1.6e-06 ***
## OBP          -491.7340   236.9478   -2.08   0.0380 *  
## SLG          -101.6760   100.8329   -1.01   0.3133    
## BA            478.4068   276.0587    1.73   0.0831 .  
## RunDiff         0.1245     0.0431    2.89   0.0039 ** 
## OBP:SLG      1228.7460   571.3088    2.15   0.0315 *  
## SLG:BA      -1148.6879   668.2086   -1.72   0.0856 .  
## SLG:RunDiff    -0.2131     0.1025   -2.08   0.0376 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 824.97  on 799  degrees of freedom
## Residual deviance: 369.85  on 791  degrees of freedom
## AIC: 387.9
## 
## Number of Fisher Scoring iterations: 8
  • It seems SLG predictor is not significant but as the test misclassification rate is low for the model we decided to keep the model. Actually many of the predictors are one way or another are correlated to each other. However the test misclassification rate of model is pretty low: 0.0972

Model to predict the runs scored

  • Based on the previous analysis we reach the conclusion that the most probabely OBP, SLG, BA and RunDiff can be the best predictors for the response RS to predict the runs scored by a team so lets give this idea a try:
model_RS = lm(RS ~ (OBP + SLG + RunDiff + BA)^2, data = MLBData)
summary(model_RS)
## 
## Call:
## lm(formula = RS ~ (OBP + SLG + RunDiff + BA)^2, data = MLBData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71.80 -15.56  -1.16  14.60  82.19 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   557.004    321.825    1.73   0.0837 . 
## OBP          -211.745   1727.957   -0.12   0.9025   
## SLG          1188.363    921.283    1.29   0.1973   
## RunDiff         0.136      0.178    0.77   0.4440   
## BA          -5846.596   2170.615   -2.69   0.0072 **
## OBP:SLG     -1678.178   3214.468   -0.52   0.6017   
## OBP:RunDiff     0.172      0.958    0.18   0.8579   
## OBP:BA      13445.007   7026.407    1.91   0.0559 . 
## SLG:RunDiff    -0.848      0.413   -2.05   0.0405 * 
## SLG:BA       3385.207   3569.975    0.95   0.3432   
## RunDiff:BA      0.920      1.189    0.77   0.4390   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.4 on 1221 degrees of freedom
## Multiple R-squared:  0.935,  Adjusted R-squared:  0.934 
## F-statistic: 1.75e+03 on 10 and 1221 DF,  p-value: <2e-16
  • Again it seems there is a collinearity problem between the predictors:
vif(model_RS)
##         OBP         SLG     RunDiff          BA     OBP:SLG OBP:RunDiff 
##      1507.5      2104.2       750.0      1758.4      5936.9      2318.5 
##      OBP:BA SLG:RunDiff      SLG:BA  RunDiff:BA 
##      6707.5       645.8      4859.1      2244.8
  • So it seems two way interaction is not necessary lets omit it:
model_RS = lm(RS ~ (OBP + SLG + RunDiff + BA + Year), data = MLBData)
summary(model_RS)
## 
## Call:
## lm(formula = RS ~ (OBP + SLG + RunDiff + BA + Year), data = MLBData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73.26 -15.90  -1.05  14.09  76.55 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -161.83847  107.74377    -1.5    0.133    
## OBP         2635.18023   94.76710    27.8  < 2e-16 ***
## SLG         1607.77852   39.63857    40.6  < 2e-16 ***
## RunDiff        0.08211    0.00788    10.4  < 2e-16 ***
## BA          -192.44212  107.01886    -1.8    0.072 .  
## Year          -0.28757    0.05638    -5.1  3.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.5 on 1226 degrees of freedom
## Multiple R-squared:  0.934,  Adjusted R-squared:  0.934 
## F-statistic: 3.5e+03 on 5 and 1226 DF,  p-value: <2e-16
  • lets also remove any potential outliars from the data:

Removing Outliers

lev = hatvalues(model_RS )
high_lev = which(lev > 2 * mean(lev))
influential = which(cooks.distance(model_RS ) > 4 / length(cooks.distance(model_RS)))
remove = c(high_lev, influential)
adj_data = MLBData[-unique(remove),]

model_RS = lm(RS ~ (OBP + SLG + RunDiff + BA + Year), data = adj_data)
summary(model_RS)
## 
## Call:
## lm(formula = RS ~ (OBP + SLG + RunDiff + BA + Year), data = adj_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -57.6  -14.7   -0.9   13.3   69.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -134.39335  104.20562   -1.29     0.20    
## OBP         2564.33740   93.05247   27.56  < 2e-16 ***
## SLG         1622.65842   38.72749   41.90  < 2e-16 ***
## RunDiff        0.07945    0.00767   10.36  < 2e-16 ***
## BA          -131.26470  106.61118   -1.23     0.22    
## Year          -0.30116    0.05467   -5.51  4.5e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 1128 degrees of freedom
## Multiple R-squared:  0.944,  Adjusted R-squared:  0.943 
## F-statistic: 3.78e+03 on 5 and 1128 DF,  p-value: <2e-16
vif(model_RS)
##     OBP     SLG RunDiff      BA    Year 
##   4.641   4.045   1.497   4.340   1.642
  • Great model, lets check the assumptions:
#Model to predict Runs Scored
diagnostics(model_RS, testit = FALSE, pcol = "red", lcol = "black")

  • Perfect!
test_diagnostic(model_RS, alpha = 0.01)
## $p_norm
## [1] 0.01129
## 
## $decision_norm
## [1] "Fail to Reject"
## 
## $p_convar
## [1] 0.4482
## 
## $decision_convar
## [1] "Fail to Reject"
  • So both normality and equal variance assumptions are satisfied and the Multiple R-squared of the model is 0.944

Model to predict the runs allowed

  • Lets add two new predictors to the previous model and do a similar steps to see we can predict run allowed (RA) of a team two new predictors are OOBP and OSLG:
model_RA = lm(RA ~ (OBP + SLG + OOBP + OSLG + RunDiff + BA + Year), data = MLBData)
model_RA = step(model_RA, direction = "backward", k = log(length(resid(model_RA))), trace = 0)
summary(model_RA)
## 
## Call:
## lm(formula = RA ~ OBP + SLG + OOBP + OSLG + RunDiff, data = MLBData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.41 -12.04   0.07  13.16  41.59 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -903.1786    28.3945  -31.81   <2e-16 ***
## OBP         1560.8045   129.4516   12.06   <2e-16 ***
## SLG         1064.2653    68.7373   15.48   <2e-16 ***
## OOBP        1178.1588   133.2230    8.84   <2e-16 ***
## OSLG         728.7765    74.0977    9.84   <2e-16 ***
## RunDiff       -0.5698     0.0264  -21.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.6 on 414 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.956,  Adjusted R-squared:  0.955 
## F-statistic: 1.79e+03 on 5 and 414 DF,  p-value: <2e-16

Removing Outliers

lev = hatvalues(model_RA )
high_lev = which(lev > 2 * mean(lev))
influential = which(cooks.distance(model_RA ) > 4 / length(cooks.distance(model_RA)))
remove = c(high_lev, influential)
adj_data = MLBData[-unique(remove),]

model_RA = lm(RA ~ (OBP + SLG + OOBP + OSLG + RunDiff + BA + Year), data = adj_data)
summary(model_RA)
## 
## Call:
## lm(formula = RA ~ (OBP + SLG + OOBP + OSLG + RunDiff + BA + Year), 
##     data = adj_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47.55 -11.61  -0.38  12.09  42.52 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -258.8644   485.5536   -0.53     0.59    
## OBP         1540.1083   139.5848   11.03  < 2e-16 ***
## SLG         1152.0925    68.5232   16.81  < 2e-16 ***
## OOBP        1007.5843   126.1389    7.99  1.8e-14 ***
## OSLG         678.2758    71.8434    9.44  < 2e-16 ***
## RunDiff       -0.6091     0.0252  -24.19  < 2e-16 ***
## BA            48.5060   137.8232    0.35     0.73    
## Year          -0.3040     0.2346   -1.30     0.20    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.9 on 368 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.964,  Adjusted R-squared:  0.964 
## F-statistic: 1.42e+03 on 7 and 368 DF,  p-value: <2e-16
vif(model_RA)
##     OBP     SLG    OOBP    OSLG RunDiff      BA    Year 
##   5.109   4.225   4.934   4.927  10.457   3.279   1.330
#Model to predict Runs Allowed
diagnostics(model_RA, testit = FALSE, pcol = "blue", lcol = "black")

test_diagnostic(model_RA, alpha = 0.01)
## $p_norm
## [1] 0.4026
## 
## $decision_norm
## [1] "Fail to Reject"
## 
## $p_convar
## [1] 0.04836
## 
## $decision_convar
## [1] "Fail to Reject"
  • So it seems we can perfectly predict runs scored and runs allowed. Lets try to predict win rate of a team.

Model to predict wins

  • Lets try to predict win rate of a team, this time lets try BIC to find a model:
model_Win = lm(W ~ (OBP + SLG + OOBP + OSLG + RunDiff + BA + Year)^2, data = MLBData)
model_Win = step(model_Win, direction = "backward", k = log(length(resid(model_Win))), trace = 0)
summary(model_Win)
## 
## Call:
## lm(formula = W ~ SLG + OOBP + RunDiff, data = MLBData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11.80  -2.71  -0.29   2.60  12.47 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   98.84458    5.33445   18.53  < 2e-16 ***
## SLG           41.38994   12.49571    3.31    0.001 ** 
## OOBP        -106.12028   23.36561   -4.54  7.3e-06 ***
## RunDiff        0.08627    0.00385   22.43  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.88 on 416 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.89,   Adjusted R-squared:  0.889 
## F-statistic: 1.12e+03 on 3 and 416 DF,  p-value: <2e-16

Removing Outliers

lev = hatvalues(model_Win )
high_lev = which(lev > 2 * mean(lev))
influential = which(cooks.distance(model_Win ) > 4 / length(cooks.distance(model_Win)))
remove = c(high_lev, influential)
adj_data = MLBData[-unique(remove),]

model_Win = lm(W ~ (OBP + SLG + OOBP + OSLG + RunDiff + BA + Year)^2, data = adj_data)
model_Win = step(model_Win, direction = "backward", k = log(length(resid(model_Win))), trace = 0)
summary(model_Win)
## 
## Call:
## lm(formula = W ~ SLG + OOBP + RunDiff, data = adj_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.072 -2.551 -0.256  2.548 11.369 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   97.47098    5.63786   17.29  < 2e-16 ***
## SLG           44.67489   12.52033    3.57  0.00041 ***
## OOBP        -106.42414   23.80684   -4.47    1e-05 ***
## RunDiff        0.08723    0.00392   22.23  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.44 on 374 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.904,  Adjusted R-squared:  0.903 
## F-statistic: 1.17e+03 on 3 and 374 DF,  p-value: <2e-16
vif(model_Win)
##     SLG    OOBP RunDiff 
##   2.770   3.650   5.099

-Perfect very small model but having high R-squared: 0.904. Although we considered two ways interactions but BIC rejected all of them.

#Model to predict Team Wins
diagnostics(model_Win, testit = FALSE, pcol = "darkgreen", lcol = "black")

test_diagnostic(model_Win, alpha = 0.01)
## $p_norm
## [1] 0.1171
## 
## $decision_norm
## [1] "Fail to Reject"
## 
## $p_convar
## [1] 0.8955
## 
## $decision_convar
## [1] "Fail to Reject"

Model to predict the batting average

  • Removing Outliers
add_model = lm(AVG ~ . , data = train_set)

lev = hatvalues(add_model)
high_lev = which(lev > 2 * mean(lev))
influential = which(cooks.distance(add_model) > 4 / length(cooks.distance(add_model)))
remove = c(high_lev, influential)
adj_data = Baseball_stats[-unique(remove),]

adj_model = lm(AVG ~ ., data = adj_data)
  • Creating The Initial Models
# Backward AIC
add_model_back_AIC = step(adj_model, direction = "backward", trace = 0)

# Backward BIC
add_model_back_BIC = step(adj_model, direction = "backward", k = log(length(resid(add_model))), trace = 0)

Adjusting BIC model

# Adjusting dataframe to only show variables chosen through backwards BIC
variables = tail(names(coef(add_model_back_BIC)), -1)
back_bic_data = subset(train_set, select = variables)

round(cor(back_bic_data), 2)
##        AB     H    HR    TB    OB    PA   DBL    SO    SH  GIDP   OBP   SLG
## AB   1.00  0.85  0.21  0.67  0.67  0.95  0.51  0.15  0.04  0.23  0.10  0.19
## H    0.85  1.00  0.26  0.80  0.80  0.83  0.63  0.02 -0.07  0.22  0.47  0.47
## HR   0.21  0.26  1.00  0.77  0.45  0.34  0.30  0.57 -0.52  0.15  0.40  0.87
## TB   0.67  0.80  0.77  1.00  0.79  0.74  0.66  0.37 -0.37  0.21  0.55  0.85
## OB   0.67  0.80  0.45  0.79  1.00  0.84  0.58  0.20 -0.19  0.13  0.79  0.59
## PA   0.95  0.83  0.34  0.74  0.84  1.00  0.54  0.24 -0.04  0.20  0.33  0.32
## DBL  0.51  0.63  0.30  0.66  0.58  0.54  1.00  0.14 -0.23  0.19  0.39  0.52
## SO   0.15  0.02  0.57  0.37  0.20  0.24  0.14  1.00 -0.31 -0.03  0.09  0.39
## SH   0.04 -0.07 -0.52 -0.37 -0.19 -0.04 -0.23 -0.31  1.00 -0.17 -0.28 -0.51
## GIDP 0.23  0.22  0.15  0.21  0.13  0.20  0.19 -0.03 -0.17  1.00  0.00  0.11
## OBP  0.10  0.47  0.40  0.55  0.79  0.33  0.39  0.09 -0.28  0.00  1.00  0.67
## SLG  0.19  0.47  0.87  0.85  0.59  0.32  0.52  0.39 -0.51  0.11  0.67  1.00
#AB seems to have a relatively high correlation with the other variables, especially PA so we're going to check the correlation with other variables
check_model = lm(AB ~ ., data = back_bic_data)
summary(check_model)$r.squared
## [1] 0.9988
#AB's correlation with other variables is very high, so we can say that the other predictors can explain the variablity in AB and we can remove it from the model
adj_bic_mod = lm(AVG ~ AB + H + HR + TB + OB + DBL + SO + SH + GIDP + OBP + SLG, data = adj_data)

#So we see max collinearity went down alot, but the VIFs are still to high so we will remove other variables that also has high correlations in the correlation matrix and check how the overall VIFs change

adj_bic_mod = lm(AVG ~ H + HR + OB + DBL + SO + SH + GIDP , data = adj_data)

test_collinearity(adj_bic_mod)
## $max_vif
## [1] 3.714
## 
## $collinearity
## [1] "Fail to Reject"
# So collinearity is now good, but we removed alot of variables so have to check if the goodness of fit still works
goodness_of_fit(adj_bic_mod, test_set = test_set, response = "AVG")
## $loocv_rmse
## [1] 0.01701
## 
## $AIC
## [1] -22765
## 
## $BIC
## [1] -22708
## 
## $adj_r2
## [1] 0.6482
## 
## $test_rmse
## [1] 0.01756
# Now that collinearity is good the R^2 value went down and the RMSE went up so we have to fix that. Maybe variable transformation will help
variables = tail(names(coef(adj_bic_mod)), -1)

back_bic_data = subset(train_set, select = variables)
back_bic_data$AVG = train_set$AVG

Guess and Check Adjusting Model

adj_bic_mod = lm(AVG ~ H + HR + DBL + OB + PA, data = adj_data)

summary(adj_bic_mod)
## 
## Call:
## lm(formula = AVG ~ H + HR + DBL + OB + PA, data = adj_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.029682 -0.001768  0.000428  0.002237  0.027503 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.75e-01   5.24e-04  524.27  < 2e-16 ***
## H            1.34e-03   4.67e-06  287.58  < 2e-16 ***
## HR           3.67e-05   6.42e-06    5.72  1.2e-08 ***
## DBL          3.55e-05   1.00e-05    3.54  0.00041 ***
## OB           5.16e-04   3.33e-06  154.79  < 2e-16 ***
## PA          -5.11e-04   1.72e-06 -296.62  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00403 on 4282 degrees of freedom
## Multiple R-squared:  0.98,   Adjusted R-squared:  0.98 
## F-statistic: 4.25e+04 on 5 and 4282 DF,  p-value: <2e-16
test_collinearity(adj_bic_mod)
## $max_vif
## [1] 4.578
## 
## $collinearity
## [1] "Fail to Reject"
goodness_of_fit(adj_bic_mod, test_set = test_set, response = "AVG")
## $loocv_rmse
## [1] 0.004037
## 
## $AIC
## [1] -35108
## 
## $BIC
## [1] -35064
## 
## $adj_r2
## [1] 0.9802
## 
## $test_rmse
## [1] 0.004042
#Model to predict Team Wins
diagnostics(adj_bic_mod, testit = FALSE, pcol = "purple", lcol = "black")

vif(adj_bic_mod)
##     H    HR   DBL    OB    PA 
## 4.360 1.346 1.752 4.578 4.466
  • As it can be seen the Residuals versus Fitted plot has a kind of strange shape, however, the other assumptions seem to be satisfied and also VIF of the parameters are low (maximum = 4.578) and R-squared of the model is very high: 0.98 so we think the model is acceptable.

Conclusion

Citation of datasets